Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17732 +/- ##
============================================
+ Coverage 40.55% 40.78% +0.23%
- Complexity 2574 2610 +36
============================================
Files 5179 5187 +8
Lines 349896 351466 +1570
Branches 44727 44999 +272
============================================
+ Hits 141890 143339 +1449
- Misses 208006 208127 +121 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
luoluoyuyu
left a comment
There was a problem hiding this comment.
Review summary
Reference counting for shared JARs fixes incorrect removal of existedJarToMD5 when only one of several UDFs is dropped. rebuildJarMetadataFromUDFTable() keeps jar metadata consistent after snapshot load. New unit tests cover the shared-jar and snapshot paths.
Two small suggestions below; otherwise this looks good to merge.
| } | ||
|
|
||
| private void addJarReference(String jarName, String jarMD5) { | ||
| existedJarToMD5.putIfAbsent(jarName, jarMD5); |
There was a problem hiding this comment.
putIfAbsent keeps the first MD5 for a jar name but always increments the reference count. That is fine when validate() runs before addUDFInTable, but if addJarReference is ever called without that check, ref count and MD5 could diverge.
Consider rejecting a conflicting MD5 inside addJarReference (same check as in validate() at lines 107-115) so this helper is safe on its own.
| deserializeExistedJarToMD5(fileInputStream); | ||
|
|
||
| udfTable.deserializeUDFTable(fileInputStream); | ||
| rebuildJarMetadataFromUDFTable(); |
There was a problem hiding this comment.
After deserializeExistedJarToMD5, rebuildJarMetadataFromUDFTable() clears both maps and rebuilds from udfTable. That makes udfTable the source of truth on load.
A short comment here explaining that deserialized existedJarToMD5 is intentionally discarded would help future readers.
jt2594838
left a comment
There was a problem hiding this comment.
May unify the jar management of UDF and PIPE (and even more).
| public synchronized void serializeJarNameToMd5AndReferenceCount(final OutputStream outputStream) | ||
| throws IOException { | ||
| ReadWriteIOUtils.write(jarNameToMd5Map.size(), outputStream); | ||
| for (final Map.Entry<String, String> entry : jarNameToMd5Map.entrySet()) { | ||
| final String jarName = entry.getKey(); | ||
| ReadWriteIOUtils.write(jarName, outputStream); | ||
| ReadWriteIOUtils.write(entry.getValue(), outputStream); | ||
| ReadWriteIOUtils.write(jarNameToReferenceCountMap.getOrDefault(jarName, 0), outputStream); | ||
| } | ||
| } |
There was a problem hiding this comment.
Is it possible or necessary to serialize an entry whose count is zero?
There was a problem hiding this comment.
Good point. A zero reference count should not be serialized: under the normal add/remove path, removeReference removes both maps when the count reaches zero. I updated the snapshot logic to serialize only entries with a positive reference count and matching md5 metadata, and to skip non-positive counts when loading snapshots as a defensive cleanup. Added a regression test for the zero-count snapshot case as well.
|



Description
This PR fixes UDF jar metadata handling in
UDFInfowhen multiple UDFs share the same jar.Background
UDFInfopreviously tracked uploaded jars only withjarName -> md5. When a UDF was dropped, the jar metadata wasremoved immediately. This breaks the case where multiple UDFs reference the same jar:
reappear after restart
Changes
This PR introduces reference counting for shared UDF jars in
UDFInfo.existedJarToReferenceCountto track how many UDFs are using each jarThis keeps shared jar metadata consistent across normal create/drop operations and snapshot recovery.
Tests
Updated
UDFInfoTestto cover the shared-jar cases:This PR has:
for an unfamiliar reader.
for code coverage.
Key changed/added classes (or packages if there are too many classes) in this PR